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Abstract 

Background: Next Generation Sequencing techniques are producing enormous amounts of biological sequence 
data and analysis becomes a major computational problem. Currently, most analysis, especially the identification of 
conserved regions, relies heavily on Multiple Sequence Alignment and its various heuristics such as progressive 
alignment, whose run time grows with the square of the number and the length of the aligned sequences and 
requires significant computational resources. In this work, we present a method to efficiently discover regions of high 
similarity across multiple sequences without performing expensive sequence alignment. The method is based on 
approximating edit distance between segments of sequences using p-mer frequency counts. Then, efficient high- 
throughput data stream clustering is used to group highly similar segments into so called quasi-alignments. Quasi- 
alignments have numerous applications such as identifying species and their taxonomic class from sequences, 
comparing sequences for similarities, and, as in this paper, discovering conserved regions across related sequences. 

Results: In this paper, we show that quasi-alignments can be used to discover highly similar segments across multiple 
sequences from related or different genomes efficiently and accurately. Experiments on a large number of unaligned 
16S rRNA sequences obtained from the Greengenes database show that the method is able to identify conserved 
regions which agree with known hypervariable regions in 16S rRNA. Furthermore, the experiments show that the 
proposed method scales well for large data sets with a run time that grows only linearly with the number and length 
of sequences, whereas for existing multiple sequence alignment heuristics the run time grows super-linearly. 

Conclusion: Quasi-alignment-based algorithms can detect highly similar regions and conserved areas across 
multiple sequences. Since the run time is linear and the sequences are converted into a compact clustering model, 
we are able to identify conserved regions fast or even interactively using a standard PC. Our method has many 
potential applications such as finding characteristic signature sequences for families of organisms and studying 
conserved and variable regions in, for example, 16S rRNA. 



Background 

With the development of Next Generation Sequencing 
techniques, there has been a massive increase in the num- 
ber of sequences available. Analyzing such a large volume 
of sequence data presents a major computational chal- 
lenge, especially since it often involves finding an optimal 
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alignment between the sequences as a first step. Multiple 
Sequence Analysis (MSA), which is most commonly used 
for aligning a set of sequences, is computationally very 
expensive. In many bioinformatics applications (e.g., 
BLAST [1], BAlibase [2], T-Coffee [3], MAFFT [4], 
MUSCLE [5,6], Kalign [7] and ClustalW2 and ClustalX2 
[8]), sequence alignment and MSA play a critical role. 
Finding the optimal alignment for a large set of sequences 
that may be related by function, evolution, or structure is 
a computationally complex task and often involves use of 
high performance computing servers and resources. 
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Alternative approaches involve creating statistical sig- 
natures from nucleotide composition frequencies. These 
so-called alignment-free methods [9] are more efficient 
both in terms of processing time and storage require- 
ments as they work with compact signatures and not 
the entire set of sequences. These methods also scale 
well for whole genome phylogenetic analysis [10] which 
is an improvement over existing methods. However, 
these methods do not retain any of the important local 
information, such as the GC content in particular areas 
of the sequences. Several other methods have tried to 
characterize sequences based on the repeated presence 
of certain shorter patterns [11,12]. These methods are 
mostly used as heuristics for identification of smaller set 
of sequences. 

It is well known that different regions of DNA sequences 
have different roles. For example, some regions are 
responsible for protein coding and are known as coding 
regions [13] while others are conserved across related spe- 
cies and can be an indication of evolutionary similarity. 
Thus, a flexible approach to sequence analysis is needed 
that can take advantage of computational efficiency of 
alignment-free techniques while still taking into account 
the unique properties of dififerent regions of sequences. 

In this work, we propose a novel method for discover- 
ing conserved regions across multiple sequences. Our 
method is based on position-sensitive word frequency 
analysis and uses high efficiency data stream clustering to 
find regions with similar word frequency distributions 
across multiple sequences. We refer to clusters as quasi- 
alignments, since a similar word frequency distribution in 
a cluster also means that the underlying segments are 
likely to be very similar. For the found clusters, we also 
retain important metadata such as the position of the 
clustered segments in the original sequences and the GC- 
content. This approach has previously been used for phy- 
logenetic classification [14]. This paper expands our 
preliminary investigation into discovering similar seg- 
ments [15] by developing a more rigorous theory of 
quasi-alignment, improved visualization and an expan- 
sion to the species level. We show how quasi-alignment 
can be used to quickly and efficiently discover conserved 
regions across multiple sequences. Finding stretches of 
identical sequences at the species level is useful for var- 
ious applications including sequence identification and 
DNA barcoding [16]. 

Quasi-alignment via position-sensitive p-mer 
clustering 

Constructing position-sensitive frequency vectors 

In this section, we present the foundation for the analysis 
and clustering framework. The basic unit of analysis for 
our case is a word inside a segment of a sequence. In the 
case of global frequency analysis used by alignment-free 



methods [9] , a word of length p is referred to as an oligo- 
mer or /7-mer. In case of DNA sequences, the words are 
formed from the set of alphabet {A, C, T, G}. In our ana- 
lysis, we are interested in the distribution of words within 
specific regions (segments) of a sequence. 

Definition 1. Given a DNA sequence x of length L, a 
segment Sij is defined as a subsequence starting at posi- 
tion i and having length I, where I < L - (i - 1). 

Next, we define the distribution of words within a 
segment. 

Definition 2. Given a segment Si^u we define NSVi^i = 
(fit fit ■ ■ ■ 1 fip) as a vector of length 4F where each ele- 
ment fi, i = {1, 2, . . . , 4^}, represents the count of a pos- 
sible p-mer in the segment. This vector represents the 
word frequency distribution in Sij and is referred to the 
segment's Numerical Summarization Vectors (NSV). 

The set of NSVs for an entire sequence can be created 
by partitioning it into equal sized segments that may or 
may not overlap. For example, a sequence of length 1500 
base pairs (bp) can be divided into 15 segments each of 
length 100 without any overlap between them. The word 
frequency distribution will later be used to find similar 
segments. Thus, the word size parameter p controls how 
well the similarity between segments is approximated 
with larger values leading to better approximation while 
smaller values lead to faster computation. We find that 
p = 3, i.e. we count the occurrence of tri-mers within a 
segment, produces good results while creating NSVs of 
length 4^ or 64. Cutting a sequence into segments and 
then creating NSVs is illustrated in Figure 1. We cur- 
rently do not take into account any unknown characters, 
such as "N" that may be present due to sequencing errors 
or ambiguous sequencing. 

Data stream clustering 

After constructing NSVs for the entire set of sequences 
to be examined, we use high-throughput data stream 
techniques [17] to cluster similar segments. We consider 
each NSV as a data point in a stream of consecutive 
NSVs. The clustering algorithm then adds one NSV 
after the other to the cluster model by adding it to an 
existing cluster if it is within a user defined clustering 
threshold from its center or otherwise creating a new 
cluster with the NSV as its first member. This idea is 
illustrated in Figure 1. In addition to the clusters we 
also retain order information in the form of a directed 
graph (shown in Figure 1). The exact clustering proce- 
dure is discussed in [18]. 

Definition 3. A GenModel M is defined as a directed 
graph G = {C, E)where the vertices are the set of clusters 
C = {Ci, C2, Cm] of NSVs and the edges E represent 
the ordering of the NSVs in the sequences. Each cluster 
contains metadata such as location in the original 
sequence and the sequence IDs. 
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Figure 1 Process of creating a GenModel. GenModel is created by dividing tlie sequence into equal sized segments and evaluating word 
frequency distributions, called Numerical Summarization Vectors (NSVs), for each segment. This example shows word size of 3. The model is 
created by comparing each new segment's frequency profile with existing clusters. The segment is assigned to the closest cluster that is within 
a threshold distance, if no such cluster is available, a new cluster is created with the segment as the first member. Here NSV3 and NS\/4 are 
close enough to be assigned to the same cluster. 



We can now reframe the problem of alignment as the 
problem of finding similar subsequences (segments) via 
clustering. 

Definition 4. Each cluster C in a GenModel M repre- 
sents a set of similar segments and is referred to as a 
quasi-alignment and the segments are said to be quasi- 
aligned. 

By using clustering, we avoid the expensive alignment 
process and yet obtain information about local similarity 
between multiple sequences. The data stream clustering 
algorithm makes just one single pass through the seg- 
ments and thus it has a linear time complexity in terms of 
the length and number of sequences. Also, new sequences 
can be added very efficiently to an existing model. 

Similarity measures for clustering 

Clustering algorithms use similarity measures for com- 
paring individual data points. In our case, the sequence 
data is converted into fixed dimension NSVs represent- 
ing the word frequency distribution within segments. 
NSVs can be compared using standard measures for 
vectors such as Manhattan distance, Euclidean distance, 
squared Euclidean distance, Kullback-Leibler discre- 
pancy and Mahalanobis distance [9]. They can also be 
compared using other measures such as the number of 
shared words between two segments. For example, Sim- 
rank [19] compares the number of matching p-mers 
(typically with p = 7) for fast sequence search. 

The distance between NSVs can be related back to the 
difference between the original sequence strings also. 
The difference between two sequences is measured in 
terms of edit distance [20], which is the minimum num- 
ber of point mutations required to change one sequence 



into another. A point mutation can be an insertion, dele- 
tion, or substitution. Ukkonen [21] has proposed that the 
edit distance between two strings can be approximated 
by the Manhattan distance between their ^-gram profiles 
(which in our case will be the /?-mer profile). The Man- 
hattan distance between two frequency vectors x and y 
obtained from the segments s^ and Sy is defined as: 

4'' 

dManhattan(^,}') = ^ \xi - Yi\ (1) 
i=l 

The Manhattan distance simply computes the number 
of p-mevs that are different between the two sequences. 
It can be shown that the Manhattan distance gives a 
lower bound for the edit distance between the two seg- 
ments. 

, , ^ ^^Manhattan (^/ y) 
4dit(Sxv5y) > 2p^ 

The reasoning behind the bound above is that any 
insertion, deletion, or substitution in the segment will 
create at most p new p-raevs and destroy p existing 
/j-mers. Note that in theory it is possible to create two 
completely different sequences with the same q-gvzm 
profile (see [21]), however, this is very unlikely if we 
deal with biological sequences which are expected to 
have a certain degree of similarity (e.g., caused by con- 
served regions or homology). 

The relationship in equation 2 can be used to determine 
a reasonable clustering threshold for the data stream clus- 
tering algorithm in [18] for a given word size p. For exam- 
ple, we often use a segment size of 100 bases with 3-mers 
and a Manhattan clustering threshold of 30. Equation 2 
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shows that this threshold means that the edit distance 
between two segments needs to be at least 30/6 = 5 to put 
them into two separate clusters. Note also that position 
specific p-mer frequency clustering is not restricted to 
using Manhattan distance, it can be used with any proxi- 
mity measure defined on the frequency counts in NSVs. 

Discovery of conserved regions from GenlVlodels 

GenModels provide vital information about the similar- 
ity between segments in the form of quasi-alignments. 
This allows us to identify regions that are highly similar 
across multiple sequences. For sequences related by evo- 
lution, such as those from the same taxonomic unit, 
these segments are known as conserved regions. They 
are likely to be responsible for a particular function or 
provide a needed structural characteristic. 

As an illustration, Figure 2 shows a GenModel created 
from the 16S rRNA sequences belonging to the phylum 
GN06. We used the unaligned version of the sequences 
from the Greengenes database [22]. The available 13 
sequences range in length between 1374 and 1525 bases. 
For building the model, these sequences were broken 
down into non-overlapping segments of size / = 100 
bases each, which were then aggregated with 3-mers 
= 3) and clustered using Manhattan distance and a 
clustering threshold of 30. The resulting GenModel con- 
tains 54 clusters or quasi-alignments. The plot shows 
each of the quasi-alignment as circles uniquely identified 
by an id. The circle size is proportional to the size of 
the clusters (i.e., number of segments participating in 
the quasi-alignment) and arrows represent the direction 
of the transitions between them. A stronger arrow indi- 
cates that the transition occurs with a higher probability. 
For example, Figure 2 shows that one of the common 
transition paths is the quasi-alignment sequence 26 —> 
27 ^ 28 ^ 4 ^ . . . ^ 12 ^ 13 ^ 14 indicating that a 
large fraction of the sequences share some common 
sequence segments. In addition the plot shows that 
almost all sequences go through a few quasi-alignments 
(e.g., 4, 6, 10 and 14) which represent candidates for 
regions that may be highly conserved in the set of ana- 
lyzed sequences. The sequences share a common higher 
level taxonomy (phylum) and thus we expect relatively 
stronger quasi-alignments as compared to random 
sequences. Interesting in Figure 2 is the almost comple- 
tely separate path of smaller clusters starting with quasi- 
alignment number 16. This indicates that a single or a 
few sequences are significantly different from the major- 
ity of the sequences in the set. This might be due to sev- 
eral reasons, such as mis-classification of the sequence, 
or a sequencing error whereby some of the initial bases 
may have been removed. 

In Figure 3, we visualize the largest quasi-alignments 
found in the GenModel along the approximately 1500 



bases (x-axis). The top part of Figure 3 shows the seg- 
ments grouped into the 5 most popular quasi-align- 
ments as red horizontal lines. In this model, all red 
horizontal lines are exactly 100 bases long because a 
segment length of 100 was used. The segments that are 
part of the same quasi-alignment are joined by vertical 
dotted lines and the cluster id from Figure 2 is shown 
on top. We see that the well preserved segments are 
found in quasi-alignment 4, 6, 31, 10 and 14 which cor- 
respond to the largest clusters in the model where 
almost all sequences converge in Figure 2. The lower 
part of Figure 3 shows a measure of consensus for each 
segment i.e. proportion of sequences clustered into the 
most popular quasi-alignment. For example, it shows 
that all of the sequences in the nucleotide region 900- 
1000 converge in quasi-alignment 10. Similarly, all 
except one sequence converge in quasi-ahgnment 4, 6 
and 14 for the nucleotide regions 300-400, 500-600 and 
1300-1400, respectively. This is an indication that these 
segments are highly similar and could be conserved 
regions of the sequences. 

To validate our claim that the segments that are clus- 
tered together into a quasi-alignment are indeed highly 
similar, we performed traditional Multiple Sequence 
Alignment (MSA) on the segments that are part of quasi- 
alignment 10. We used Clustal [8] available through the 
software JalView [23,24] to perform MSA and visualize the 
alignment in Figure 4. The results show that the segments 
have an average pairwise alignment score of 0.94 (out of a 
maximum possible of 1.00) with a large majority of seg- 
ments being almost identical and having 100% pairwise 
alignment. Figure 4 and the MSA results indicate that the 
segments in quasi-alignment 10 are indeed very similar. 
We have performed a similar analysis on the other quasi- 
alignment shown in Figure 3 and verified that the seg- 
ments have a high degree of base-wise identity. 

It is interesting to note that in Figure 4 some sequences 
have bases that are "shifted" by a certain amount. For 
example, the first sequence shown with id 159470, has its 
bases shifted to the right by 29 bases. This can be the 
result of insertions/deletions (indels) in the sequences due 
to evolutionary processes. If this offset becomes too large, 
then it can interfere with clustering segments. This pro- 
blem can be removed by using overlapping segments, i.e., 
considering many or all possible offsets. This increases the 
time complexity in the worst case by a fixed factor of / 
(segment length). It also makes makes visualizing quasi- 
alignments more complicated and therefore we will 
restrict the discussion in this paper to non-overlapping 
segment. 

Further validation of the location of the found con- 
served regions can be obtained by looking at biological 
evidence available in the literature. Studies have reported 
that 16S rRNA contains regions that are highly conserved 



Nagar and Hahsler BMC Bioinformatics 2013, 14(Suppl 1 1):S2 
httpy/www.biomedcentral.com/1 471-21 05/1 4/S1 1/S2 



Page 5 of 12 




Figure 2 GenModel for the phylum GN06. GenModel of 13 16S rRNA sequences from the phylum GN06 with the circles denoting quasi- 
alignments (clusters of segments) with a unique id. Arrows show the preserved order information of segments in the original sequences. For 
example, all sequences with a segment participating in quasl-alignment 31 have the next segment in quasi-alignment 32. 



within each species, but variable between species. These 
regions are known as hypervariable regions [25,26], 
which are characteristic for each species, and have apph- 
cations such as PCR amplification using universal 
primers [25] . It has also been reported that hypervariable 
regions are flanked on both sides by regions that are 
highly conserved across multiple species [27,28]. These 
flanking regions are conserved even for sequences 



exhibiting wide genomic diversity such as environmental 
or biological samples. 

Nine identified hypervariable regions in 16S rRNA 
consist of nucleotides number 69-99, 137-242, 433-497, 
576-682, 822-879, 986-1043, 1117-1173, 1243-1294 and 
1435-1465, and are denoted by VI through V9, respec- 
tively [25]. The sequence data of the phylum GN06 con- 
tains 13 sequences potentially from multiple species. 
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Figure 3 Plot of the top 5 quasi-alignments from the phylum GN06. The top plot shows the positions of the segments that belong to the 
five strongest quasi-alignments. These segments indicate well preserved regions in the sequences. The bottom plot shows the consensus 
among the quasi-alignments for the segment, i.e., fraction of sequences participating in the most common quasi-alignment for each segment. 



which in the data have not been identified and hence 
are coded as unknown in the Greengenes database. 
Since the data contains several species, we would expect 
greater variations in the hypervariable regions than in 
the flanking, preserved regions. The top part of Figure 3 
shows the 5 largest quasi-alignments (clusters) found for 
the GenModel for the sequences from GN06. The posi- 
tions of the hypervariable regions are shown as blue 
lines labeled VI through V9 at the top of the plot. It is 
very clear that our algorithm identifies regions that 



flank the hypervariable regions. For example, quasi- 
alignment 10 covers the nucleotide bases between 
the hypervariable regions V5 and V6. Similarly, quasi- 
alignment 4, 6, 31, and 14 cover the bases between 
hypervariable regions V2, V3, V4, V5, V6, V8 and V9. 
The plot in the lower part of Figure 3 also confirms this 
finding. The peaks of the plots indicate those segments 
that have a high consensus i.e. a strong quasi-alignment. 
The region between bases 900 and 1000 share a perfect 
consensus, i.e., all the segments belong to the same 
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Figure 4 Multiple sequence alignment of segments that are part of quasi-alignment 10 in GN06 Multiple sequence alignment of the 
segments forming quasi-alignment 10 (at positions 901-1000) in the GenlVlodel for GN06 (visualized with JalView [23]). 
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quasi-alignment. As discussed earlier, this area lies 
between hypervariable regions and is thus is expected to 
be more conserved for a sample containing multiple 
species. 

Implementation details 

We have implemented an open source software package 
using the R framework called QuasiAlign which can be 
downloaded from http://r-forge.r-project.org/proiects/ 
mmsa/. This package has methods to quickly load a large 
set of sequence files, that can be in FASTA format with 
Greengenes [22] annotations, into a relational database 
and can be used to easily filter sequences belonging to 
any taxonomic rank. This package is built on top of a 
number of other packages including Biostrings [29] for 
handling sequences, and the data stream clustering pack- 
age rEMM [30,31]. It provides a complete interface for 
managing sequences, creating word frequencies distribu- 
tions (NSVs) and creating and analyzing GenModels. 
Several other useful functions, such as those for metage- 
nomic classification, are also available. More details can 
be obtained firom the package documentation [32]. 

For the analysis in this paper, we have used the default 
parameters for creating NSVs and GenModels. The 
parameters for NSVs are a segment length of / = 100 
bases with no overlap between and segments and a 
word size of p = 3. For creating GenModels, the default 
is Manhattan distance with a clustering threshold dis- 
tance of 30 which requires a minimum edit distance of 
5 between two segments to place the into separate clus- 
ters (see equation 2 above). 

Large scale experiments 

Position sensitive p-mer clustering can work with a set 
of DNA/RNA sequences or even fragments of sequences 
from any source. This is a valuable asset for metage- 
nomic analysis and also fits in nicely with the require- 
ments of Next Generation Sequencing methods. All 
experiments in this paper can be reproduced using the 
QuasiAlign [32] package. 

Dataset used 

The method presented here for discovering conserved 
regions is general enough to be applied to any set of 
sequences. For this analysis, we used the more than 
400,000 168 rRNA sequences obtained from the Green- 
genes project [22]. The 16S gene is widely used for phy- 
logenetic analysis as it is highly conserved for different 
species of bacteria and archaea. The sequences of this 
gene have remained more or less constant over time 
and evolutionary cycles. Further, it contains several dis- 
tinct regions, known as hypervariable regions, that are 
very specific and unique for each individual species [25] 
and are widespread used for sequence identification and 



classification. The package QuasiAlign allows us to 
directly import FASTA sequences with Greengenes 
annotations into a relational database for further 
analysis. 

Results 

We processed the entire Greengenes 16S rRNA database 
using the default settings for creating NSVs and Gen- 
Models and then analyzed the models for interesting 
patterns and clusters to search for highly similar or con- 
served regions across multiple sequences that may be 
related by taxonomy. 

As an example, we present an analysis of the species 
Leptotrichia buccalis that belongs to the phylum Fuso- 
bacteria and genus Leptotrichia. The database contains 
11 sequences of this species having lengths between 
1310 and 1510 bases. We ran the position sensitive 
p-mer: clustering algorithm on these sequences. The plot 
of the GenModel is shown in Figure 5 and the plot of 
the segments belonging to the 5 largest quasi-alignment 
is shown in Figure 6. It is easy to see that most 
sequences follow a similar path with the exception of 
one sequence that starts a totally different path starting 
from quasi-alignment 20 and ending at quasi-alignment 
32. This outlier sequence may have its bases shifted by a 
certain amount due to insertions or deletions giving it 
somewhat of a different frequency profile. Since the 
sequences belong to the same species, the hypervariable 
regions are expected to be highly conserved. The top 5 
quasi-alignments contain segments that belong to hyper- 
variable regions V2, V3, and V4. The bottom part of 
Figure 6 shows that the consensus of quasi-alignments 
peaks at the hypervariable regions. To check these 
results, we also performed MSA using Clustal [8] on the 
segments belonging to quasi-alignments 2 and 3. The 
average pairwise alignment score is 0.93, and many 
sequences being perfectly aligned with a pairwise align- 
ment score of 1.00. The results are shown graphically in 
Figure 7 for a section of the 200 bases that are perfectly 
aligned. 

A second example comes for the species Cetobacter- 
ium somerae that belongs to the phylum Fusobacteria 
and genus Cetobacterium. There are 207 sequences in 
the dataset for this species having lengths between 1335 
and 1472 bases. We created the GenModel shown in 
Figure 8. Since there are about 20 times more sequences 
than for the earlier, the model is more complex and has 
many more clusters and related transitions. Still, we can 
see that there are certain clusters and transitions that 
are more pronounced and the most common path is 
28 ^ 29 ^ 36 ^ ... 26 ^ 27 ^ 35. The plot of the seg- 
ments belonging to the top 5 quasi-alignments are shown 
in Figure 9 at the top and the consensus for the quasi- 
alignments is shown at the bottom. It is easy to see that 
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the consensus mostly peaks at the location of the hyper- 
variable regions, implying that these segments are similar 
and cluster together in the same quasi-alignment. We 
have also performed MSA on the segments belonging to 
the largest clusters and they have very high nucleotide 
similarity with several sub-regions having perfect 
matches. 

Applications 

There are several possible applications of our method. 
The first is to discover regions of very high sequence 
alignment by limiting the search space to regions of 
strong quasi-alignment. For the case of the species 



Leptotrichia buccalis, we have identified in the sequences 
that the region between nucleotides 100 and 300 has a 
highly similar word frequency distribution. While this 
does not necessarily mean that all bases in this region 
will be perfectly aligned, it does indicate that this region 
is a good candidate for alignment. Therefore, the search 
space for the best alignment can be reduced from the 
entire sequence length to just the strongly quasi-aligned 
segments. This can result in substantial savings in com- 
putational resources and time and produce results more 
efficiently. 

Another application is in the area of DNA barcoding 
[16], which seeks to identify species based on sequence 
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Figure 6 Plot of the top 5 quasl-allgnments from the species Leptotrichia buccalis The top plot shows the 5 strongest quasi-alignments 
from the species Leptotrichia buccalis. The bottom plot shows the consensus of the quasi-alignments. 



segments that are standardized and well-conserved aligned. For example, in case of the species Leptotrichia 

across sequences belonging to the same species. By buccalis, we discovered the region between nucleotide 

using our methods, we can limit the search for DNA base positions 100-300 contain highly similar sequences, 

barcodes to those regions that are strongly quasi- Further analysis of MSA results reveals that in this 



y 31 41 51 61 

11296/1-200 CCCf^CAWACAlWWATAACAlACfrtAAActACTTlATAAf ACCT^Af A 

2S0J99/2-2(>0 CCCTCCACACACCCATAAC ACACCCAAACCACTCATAATACCTCATA 
469400/1-200 C C C T C C AC A C ACCC AT A A C ACA CCC A A A CC A C T C AT A AT A C CTC AT A 
SOJOSJ/2-200 CCCTCCACACACCCATAACACACCCAAACCACTCATAATACCTCATA 
S0ii00/J-2(W CCCTCCACACACCCATAACACACCCAAACCACTCATAATACCTCATA 
S29SS7/2-2W CCCTCC AC AC ACCC AT A AC ACACCCA A ACCACTCAT A AT AC CTCAT A 
529SSS/2-200 CCCTCCACAC ACCCAT A AC ACACCCA A ACCACTCAT A AT AC CTCAT A 
S29S60/J-2W CCCTCC ACACACCCATAAC ACACCCAAACCACTCATAATACCTCATA 
529g6i/i-200 CCCTGC ACA CA CCC AT A AC ACA CCCA A ACCACTCAT A AT AC CTCAT A 
5J506^/J-200CCCH|CA|ACA|HHAiAACACACfifiAAACfiiACTCATAATACCTCATA 



Consensus 




CCCTCCACACACCCATAACACACCCAAACCACTCATAATACCTCATA 

Figure 7 MSA of a section of quasi-alignment 2 showing perfectly aligned bases Plot of the Multiple Sequence Alignment of a section of 
quasi-alignment 2 between bases 122 and 158 for the species Leptotrichia buccalis showing perfect alignment. 
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Figure 8 GenModel of 16S rRNA sequences from the species Cetobacterium somerae. 



region the nucleotide positions 122-168 and 183-235 are 
exactly identical. These regions can form the basis of a 
more thorough DNA barcoding analysis. 

Run time analysis 

The existing methods for analyzing a group of sequences 
for similarity rely mostly on Multiple Sequence Alignment 
(MSA). Finding the optimal MSA is known to be NP hard 
and thus computationally challenging [33,34]. Various 
heuristics are currently used for MSA based analysis. 
Progressive alignment is a heuristic method that first con- 
structs a guide tree based on relationships between the 
sequences and then builds the MSA by iteratively adding 
sequences from the guide tree to the alignment. The time 
complexity of progressive alignment is 0{N^I?) where N 
the number of sequences having average length L [35]. 

In contrast to the above methods, position sensitive p- 
mer clustering makes just one pass through each of the 
sequences to create the NSVs and construct the GenMo- 
dels. Thus, the time complexity of our method is 0{LN) 
for N sequences of average length L. In addition, adding 



new sequences to an existing model is very easy since the 
use data stream clustering allows us to add new NSVs at 
any time. 

The above advantages allow us to analyze a large set of 
sequences for similarity and allow easy discovery of con- 
served regions. Our algorithm can easily analyze the entire 
data set from the Greengenes [22] using a simple personal 
computer. Performing such an analysis using traditional 
MSA would require extensive server resources and com- 
puting time. 

To compare run times of our method against MSA, 
we incrementally sampled between 10 and 100 
sequences from the phylum Fusobacteria and ran quasi- 
alignment and the Clustal [8] implementation of MSA 
on them and noted the run times. The plot is shown in 
Figure 10. It is clear that the run time for quasi-align- 
ment increases linearly with the number of sequences 
while for Clustal it grows polynomially. Because of this, 
quasi-alignment scales well for larger number of 
sequences and can provide accurate results quickly and 
efficiently. 
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Figure 10 Comparing run times of quasi-alignment and Clustal 
implementation of MSA 



Conclusion 

In this work, we have presented the foundation for quasi- 
ahgnment based on position sensitive p-mev clustering, a 
technique which applies high- throughput data stream clus- 
tering to produce GenModels, where strong clusters repre- 
sent potential areas of high sequence similarity. In contrast 
to MSA heuristics, the runtime of quasi-alignment scales 
linearly in the number of sequences and the average 
sequence length. This allows us to process larger number 
of sequences efficiently. We carried out experiments for 
sequences consisting of single and multiple species and 
verified the accuracy of our results by comparing them to 
traditional MSA and using biological evidence from the 
hypervariable regions. 

There are many possible applications such as identifi- 
cation of identical DNA fragments and their positions 
within multiple sequences for DNA barcoding studies. 
Our methods can reduce the search space from the entire 
length of DNA sequences to just those regions that are 
part of stronger quasi-alignments. Other applications 
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might include identification of sequences from their 
quasi-alignment models and finding interesting regions 
within sequences, such as those with high GC content. 

In this paper we have restricted our discussion to creating 
non-overlapping segments. For dealing with sequences 
which contain a larger amount of insertions and deletions or 
for classification of shorter fragments sampled randomly 
from the sequence, it is necessary to use overlapping 
segments while constructing GenModels. The runtime 
complexity increases only by the constant factor I, the seg- 
ment length. We are currently working on expanding the 
QuasiAlign package to support use of overlapping segments. 
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